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ABSTRACT 

In many astronomical problems one often needs to determine the upper and/or lower bound- 
ary of a given data set. An automatic and objective approach consists in fitting the data using 
a generalised least-squares method, where the function to be minimized is defined to handle 
asymmetrically the data at both sides of the boundary. In order to minimise the cost func- 
tion, a numerical approach, based on the popular DOWNHILL simplex method, is employed. 
The procedure is valid for any numerically computable function. Simple polynomials provide 
good boundaries in common situations. For data exhibiting a complex behaviour, the use of 
adaptive splines gives excellent results. Since the described method is sensitive to extreme 
data points, the simultaneous introduction of error weighting and the flexibility of allowing 
some points to fall outside of the fitted frontier, supplies the parameters that help to tune 
the boundary fitting depending on the nature of the considered problem. Two simple exam- 
ples are presented, namely the estimation of spectra pseudo-continuum and the segregation of 
scattered data into ranges. The normalisation of the data ranges prior to the fitting computa- 
tion typically reduces both the numerical errors and the number of iterations required during 
the iterative minimisation procedure. 

Key words: methods: data analysis - methods: numerical. 



1 INTRODUCTION 

Astronomers usually face, in their daily work, the need of de- 
termining the boundary of some data sets. Common examples 
are the computation of frontiers segregating regions in diagrams 
(e.g. colour-colour plots), or the estimation of reasonable pseudo- 
continua of spectra. Using for illustration the latter example, sev- 
eral strategies are initially feasible in order to get an analytical de- 
termination of that boundary. One can, for example, fit a simple 
polynomial to the general trend of the considered spectrum, mask- 
ing previously disturbing spectroscopic features, such as important 
emission lines or deep absorption characteristics. Since this fit tra- 
verses the data, it must be shifted upwards a reasonable amount in 
order to be placed on top of the spectrum. However, since there 
is no reason to expect the pseudo-continuum following exactly the 
same functional form as the polynomial fitted through the spec- 
trum, that shift does not necessarily provides the expected answer. 
As an alternative, one can also force the polynomial to pass over 
some special data points, which are selected to guide (actually to 
force) the fit through the apparent upper envelope of the spectrum. 
With this last method the result can be too much dependent on the 
subjectively selected points. In any case, the technique requires the 
additional effort of determining those special points. 

With the aim of obtaining an objective determination of the 
boundaries, an automatic approach, based on a generalisation of 



the popular least-squares method, is presented in this work. Sec- 
tion [2] describes the procedure in the general case. As an example, 
the boundary fitting using simple polynomials is included in this 
section. Considering that these simple polynomials are not always 
flexible enough, Section [3] presents the use of adaptive splines, a 
variation of the typical fit to splines that allows the determination 
of a boundary that smoothly adapts to the data in an iterative way. 
Section [4] shows two practical uses of this technique: the compu- 
tation of spectra pseudo-continuum and the determination of data 
ranges. Since the scatter of the data due to the presence of data 
uncertainties tends to bias the boundary determinations, Section [5] 
analyses the problem and presents a modification of the method 
that allows to confront this situation. Finally Section[6]summarises 
the main conclusions. In addition, Appendix lAl discusses the inclu- 
sion of constraints in the fits, whilst Appendix iBldescribes how the 
normalisation of the data ranges prior to the data fitting can help to 
reduce the impact of numerical errors in some circumstances. 

The method described in this work has been implemented into 
the program BoundFit, a FORTRAN code written by the author 
and available (under the GNU General Public Licensqj, version 3) 
at the following URL 

|http : / /www . ucm . es/inf o/Astrof /sof tware/boundf it| 
All the fits presented in this paper have been computed with this 
program. 



E-mail: ncl@astrax.fis.ucm.es 



See license details at http : / /f sf . org 
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Figure 1. Graphical illustration of the asymmetrical weighting scheme de- 
scribed in Section lzTl for the determination of the upper boundary of a par- 
ticular data set. In this example a second-order polynomial is employed. The 
continuous thick line is the traditional (symmetric) ordinary least-squares 
fit for the whole set of data points, which is used as an initial guess for the 
boundary determination. The filled red circles are data points above that fit 
(i.e. outside), whereas the open blue circles are found below such frontier 
(inside). Filled circles receive the extra weighting factor parametrized by 
the asymmetry coefficient J; introduced in Eq. (5). Since this parameter is 
chosen to be § >> 1, the minimisation process shifts the initial fit towards 
the upper region. By iterating the procedure, the final boundary fit, shown 
as the green dashed line, is obtained. The same method, but exchanging 
symbols weights, could be employed to determine the lower boundary limit 
(not shown). 



2 A GENERALISED LEAST-SQUARES METHOD 
2.1 Introducing the asymmetry 

The basic idea behind the method that follows is to introduce, in 
the fitting procedure, an asymmetric role for the data at both sides 
of a given fit, so the points located outside relative to that fit pull 
stronger toward themselves than the points at the opposite side. 
This idea is graphically illustrated in Fig. [T] As it is going to be 
shown, the problem is numerically treatable. In order to use the 
data asymmetrically, it is necessary to start with some initial guess 
fit, that in practice can be obtained employing the traditional least- 
squares method (with a symmetric data treatment). Once this ini- 
tial fit is available, it is straightforward to continue using the data 
asymmetrically and, in an iterative process, determine the sought 
boundary. 

Let's consider the case of a two-dimensional data set consist- 
ing in N points of coordinates (xi,yi), where Xi is an independent 
variable, and yi a dependent variable, which value has an associ- 
ated and known uncertainty Oi . An ordinary error- weighted least- 
squares fit is obtained by minimising the cost function f (also called 
objective function in the literature concerning optimisation strate- 
gies), defined as 



/(a ,ai, . . . 




where y(xi) is the fitted function evaluated at x = Xi, and 
fflO) ax, • • • i CL P are the unknown (p + 1) parameters that define 
such function. Actually, one should write the fitted function as 
y(a , ai, . . . , a p ; x). 

In order to introduce the asymmetric weighting scheme, the 
cost function can be generalised introducing some new coefficients, 
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where a is now a variable exponent (a = 2 in normal least 
squares). For that reason the distance between the fitted function 
y(xi) and the dependent variable yi is considered in absolute value. 
The new overall weighting factors Wi are defined differently de- 
pending on whether one is fitting the upper or the lower boundary. 
More precisely 
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being f3 the exponent that determines how error weighting is in- 
corporated into the fit (/3 — to ignore errors, (5 — 2 in normal 
error-weighted least squares), and £ is defined as an asymmetry co- 
efficient. Obviously, for a — j3 = 2 and f = 1, Eq. {2]l simplifies 
to Eq. Q}. As it is going to be shown later, the asymmetry coeffi- 
cient must satisfy £ >> 1 for the method to provide the required 
boundary fit. 

Leaving apart the particular weighting effect of the data un- 
certainties <Tj, the net outcome of introducing the factors Wi is that 
the points that are classified as being outside from a given frontier 
simply have a higher weight that the points located at the inner side 
(see Fig. QJ, and this difference scales with the particular value of 
the asymmetry coefficient f . 

Thus, the boundary fitting problem reduces to finding the 
(p + 1) parameters do, ax, • • • , a p that minimise Eq. {2]l, subject to 
the weighting scheme defined in Eq. {3}. In the next sections sev- 
eral examples are provided, in which the functional form of y(x) is 
considered to be simple polynomials and splines. 

2.2 Relevant issues 

The method just described is, as defined, very sensitive to extreme 
data points. This fact, that at first sight may be seen as a serious 
problem, it is not necessarily so. For example, one may be inter- 
ested in constraining the scatter exhibited by some measurements 
due to the presence error sources. In this case a good option would 
be to derive the upper and lower frontiers that surround the data, 
and in this scenario there is no need to employ an error-weighting 
scheme (i.e. (3 = would be the appropriate choice). On the other 
hand, there are situations in which the data sample contains some 
points that have larger uncertainties than others, and one wants 
those points to be ignored during the boundary estimation. Under 
this circumstance the role of the /3 parameter in Eq. l[3} is impor- 
tant. Given the relevance of all these issues concerning the impact 
of data uncertainties in the boundary computation, this topic is in- 
tentionally delayed until Section|5] At this point it is better to keep 
the problem in a more simplified version, which facilitates the ex- 
amination of the basic properties of the proposed fitting procedure. 

An interesting generalisation of the boundary fitting method 
described above consists in the incorporation of additional con- 
straints during the minimisation procedure, like forcing the fit to 
pass through some predefined fixed points, or imposing the deriva- 
tives to have some useful values at particular points. A discussion 
about this topic has been included in Appendix [A] 

Another issue of great relevance is the appearance of numeri- 
cal errors during the minimisation procedure. The use of data sets 
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exhibiting values with different orders of magnitude, or with a very 
high number of data points, can be responsible for preventing nu- 
merical methods to provide the expected answers. In some cases a 
simple solution to these problems consists in normalising the data 
ranges prior to the numerical minimisation. A detailed description 
of this approach is presented in Appendix IB"1 



2.3 Example: boundary fitting to simple polynomials 

Returning to Eq. (O, let's consider now the particular case in which 
the functional form of the fitted boundary y(x) is assumed to be a 
simple polynomial of degree p, i.e. 

y(x) = do + a\x + a 2 x 2 + . . . + a p x p . (4) 

In this case, the function to be minimized, f(ao, 01, . . . , a p ), is 
also a simple function of the [p + 1) coefficients. In ordinary least 
squares one simply takes the partial derivatives of the cost func- 
tion / with respect to each of these coefficients, obtaining a set 
of (p + 1) equations with (p + 1) unknowns, which can be eas- 
ily solved, as far as the number of independent points N is large 
enough, i.e. N > p + 1. 

However, considering the special definition of the weighting 
coefficients u>i given in Eq. l[3j, it is clear that in the general case 
an analytical solution cannot be derived without any kind of it- 
erative approach, since during the computation of the considered 
boundary (either upper or lower), the classification of a particular 
data point as being inside or outside relative to a given fit explicitly 
depends on the function y(x) that one is trying to derive. Fortu- 
nately numerical minimisation procedures can provide the sought 
answer i n an easy way. For th is purpose, the DOWNHILL simplex 
method jNelder & Mea3l 19651) is an excellent option. This numer- 
ical procedure performs the minimisation of a function in a multi- 
dimensional space. For this method to be applied, an initial guess 
for the solution must be available. This initial solution, together 
with a characteristic length-scale for each parameter to be fitted, is 
employed to define a simplex (i.e., a multi-dimensional analogue 
of a triangle) in the solution space. The algorithm works using only 
function evaluations (i.e. not requiring the computation of deriva- 
tives), and in each iteration the method improves the previously 
computed solution by modifying one of the vertices of the simplex. 
The simplex adapts itself to the local landscape, and contracts on 
to the final minimum. The numerical procedure is halted once a 
pre-fixed numerical precision in the sought coefficients is reached, 
or when the number of iterations exceeds a pre-defined maximum 
value iVmaxiter. A well-known i mplementation of the DOWNHILL 
simplex method is provided by IPress et al. (20020. For the par- 
ticular case of minimising Eq. ((2} while fitting a simple polyno- 
mial, a reasonable guess for the initial solution is supplied by the 
coefficients of an ordinary least-squares fit to a simple polynomial 
derived by minimising Eq. Q}. 

It is important to highlight that whatever the numerical method 
employed to perform the numerical minimisation, the considered 
cost function will probably exhibit a parameter-space landscape 
with many peaks and valleys. The finding of a solution is never 



2 Since the Numerical Recipes license is too restrictive (the routines cannot 
be distributed as source), the implementation of DOWNHILL included in the 
program BoundFit is a personal version created by the author to avoid 
any legal issue, and as such it is distributed under the GNU General Public 
License, version3. 



a guarantee of having found the right answer, unless one has the re- 
sources to employ brute force to perform a really exhaustive search 
at sufficiently fine sampling of the cost function to find the global 
minimum. In situations where this problem can be serious, more 
robust methods, like t hose provided by gen etic algorithms, must 
be considered (see e.g. lHaupt & Haupt 2004). Fortunately, for the 
particular problems treated in this paper, the simpler DOWNHILL 
method is a good alternative, considering that the ordinary least- 
squares method will likely give a good initial guess for the expected 
solution in most of the cases. 

For illustration, Fig. |2ji displays an example of upper bound- 
ary fitting to a given data set, using a simple 5th order polynomial. 
As initial guess for the numerical minimisation, the ordinary least- 
squares fit for the data (shown with a dashed blue line) has been 
employed. The grey lines represent the corresponding boundary fits 
obtained using the DOWNHILL method previously described. Each 
line corresponds to a pre-defined maximum number of iterations 
A'maxiier in DOWNHILL, as labelled over the lines in the plot inset. 
In this particular example the fitting procedure has been carried 
out without weighting with errors (i.e., assuming j3 = 0), and us- 
ing a power a — 2 and an asymmetry coefficient £ = 1000. It is 
clear that after a few iterations the intermediate fits move upwards 
from the initial guess (dashed blue line), until reaching the location 
marked with -/V max i lel = 31. Beyond this number of iterations, the 
fits move downwards slightly, rapidly converging into the final fit 
displayed with the continuous red line. Fig.^ displays the effect of 
modifying the asymmetry coefficient f . The ordinary least-squares 
fit corresponds to £ = 1 (dashed blue line). The asymmetric fits are 
obtained for £ > 1. The figure illustrates how for £ = 10 and 100 
the resulting upper boundaries do still leave points in the wrong side 
of the boundary. Only when £ = 1000 (continuous red line) is the 
boundary fit appropriate. Thus, a proper boundary fitting requires 
the asymmetry coefficient to be large enough to compensate for the 
pulling effect of the points that are in the inner side of the bound- 
ary. On the other hand, Fig. [2J; shows the impact of changing the 
power a in Eq. For the lowest value, a — 1 (dotted blue line), 
the fit is practically identical to the one obtained with a — 2 (con- 
tinuous red line). For the largest values, a = 3 or 5 (dotted green 
and dashed orange lines), the boundaries are below the expected 
location, leaving some points outside (above) the fits. In these last 
cases the power a is too high and, for that reason, the distance from 
the boundary to the more distant points in the inner side have a too 
high effect in the cost function given by Eq. (2). 

Another important aspect to take into account when using 
a numerical method is the convergence of the fitted coefficients. 
Fig. [3] displays, for the same example just described in Fig. [2j>, 
the values of the 6 fitted polynomial coefficients as a function of 
the maximum number of iterations allowed. The figure includes 
the results for £ = 10, 100 and 1000 (using a = 2 and /3 = in 
the three cases). In overall, the convergence is reached faster when 
£ = 1000. Fig. [2^ already showed that for this particular value of 
the asymmetry coefficient a quite reasonable fit is already achieved 
when iV m axiter=31. Beyond this maximum number of iterations the 
coefficients only change slightly, until they definitely settle around 
AUxte ~ 140. 

Although simple polynomials can be excellent functional 
forms for a boundary determination (as shown in the previous ex- 
ample), when the data to be fitted exhibit rapidly changing values, 
a single polynomial is not always able to reproduce the observed 
trend. A powerful alternative in these situations consists in the use 
of splines. The next section presents an improved method that us- 
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Figure 2. Panel (a): Example of upper boundary fitting using a 5th order 
polynomial. The initial data set correspond to 100 points randomly drawn 
from the function y(x) = l/x, assuming the uncertainty a = 10 for all the 
points in the y-axis. The dashed blue line is the ordinary least-squares fit to 
that data, used as the initial guess for the numerical determination of the 
boundary. Since all the points have the same uncertainty, there is no need 
for an error-weighted procedure. For that reason /3 = has been used in 
Eq. (3)- In addition a = 2 and an asymmetry coefficient £ = 1000 were 
employed. The grey lines indicate the boundary fits obtained for -/V max i ler in 
the range from 5 to 2000 iterations, at arbitrary steps. The inset displays a 
zoomed plot region where some particular values of -/V max i lcr are annotated 
over the corresponding fits. The continuous red line is the final boundary de- 
termination obtained using ./V max j ti , r = 2000. Panel (b): Effect of employ- 
ing different asymmetry coefficients £ for the upper boundary fit shown 
in panel (a). In the four cases the same maximum number of iterations 
(Amaxiter = 2000) has been employed, with a = 2. Panel (c): Effect of us- 
ing different values of the power a, with Af raax i tC i = 2000 and £ = 1000. 
See discussion in Section l2~3l 



ing classic cubic splines, but introducing additional degrees of free- 
dom, offers a much larger flexibility for boundary fitting. 
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Figure 3. Variation in the fitted coefficients, as a function of the num- 
ber of iterations, for the upper boundary fit (5th order polynomial 
y(x) = Jj_ ai x ^ snown m Fig- lit- Each panel represents the coef- 
ficient value at a given iteration (Oj, with i = 0, . . . , 5, from bottom to top) 
divided by a*, the final value derived after N milxlta = 2000 iterations. The 
same y-axis range is employed in all the plots. Red lines correspond to an 
asymmetry coefficient £ = 1000, whereas the blue and green grey lines in- 
dicate the coefficients obtained with § = 10 and § = 100, respectively (in 
all the cases a = 2 and /3 = have been employed). Note that the plot 
x-scale is in logarithmic units. 

3 ADAPTIVE SPLINES 

3.1 Using splines with adaptable knot location 

Splines are commonly employed for interpolation and modelling of 
arbitrary functions. Many times they are preferred to simple poly- 
nomials due to their flexibility. A spline is a piecewise polynomial 
function that is locally very simple, typically third-order polynomi- 
als (the so called cubic splines). These local polynomials are forced 
to pass through a prefixed number of points, A^nots, which we will 
refer as knots. In this way, the functional form of a fit to splines can 
be expressed as 

y(x) = s 3 (k)[x - a; kno t(fc)] 3 + s 2 (k)[x - x knot (k)] 2 + 
+ s 1 (k)[x - x kno t(k)] + s (k), 

where (x knot (k), j/knot(fe)) are the {x,y) coordinates of the 
fc th knot, and sn(fc), si(fe), S2(fc), and S3(k) are the corre- 
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sponding spline coefficients for x £ [xi_ not (k) , x^ not (k + 1)], with 
k = 1, . . . , Aknots — 1. These coefficients are easily computable by 
imposing the set of splines to define a continuous function and that, 
in addition, not only the function, but also the first and second 
derivatives match at the knots (two additional conditions are re- 
quired; typically they are provided by assuming the second deriva- 
tives at the two endpoints to be zero, leading to what are normally 
referred as natural splines). The c omputation of splines is w idely 
described in the literature (see e.g. [Gerald & Wheatle yl 19891) . 

The final result of a fit to splines will strongly depend on both, 
the number and the precise locat ion of the knot s. With the aim of 
having more flexibility in the fits. lCardieil |l999) explored the pos- 
sibility of setting the location of the knots as free parameters, in 
order to determine the optimal coordinates of these knots that im- 
prove the overall fit of the data. The solution to the problem can 
be derived numerically using any minimisation algorithm, as the 
DOWNHILL simplex method previously described. In this way the 
set of splines smoothly adapts to the data. The same approach can 
be applied to the data boundary fitting, using as functional form for 
the function y(x) in Eq. l(2j the adaptive splines just described. It 
is important to highlight that in this case the optimal boundary fit 
requires not only to find the appropriate coefficients of the splines, 
but also the optimal location of the knots. 

3.2 The fitting procedure 

In order to carry out the double optimisation process (for the coef- 
ficients and the knots location) required to compute a boundary fit 
using adaptive splines, the following steps can be followed: 

(i) Fix the initial number of knots to be employed, Aknots- Us- 
ing a large value provides more flexibility, although the number of 
parameters to be determined logically scales with this number, and 
the numerical optimisation demands a larger computational effort. 

(ii) Obtain an initial solution with fixed knot locations. For this 
purpose it is sufficient, for example, to start by dividing the full 
a; -range to be fitted by (Aknots — 1). This leads to a regular distri- 
bution of equidistant knots. The initial fit is then derived by min- 
imising the cost function given in Eq. {2]l, leaving as free parame- 
ters the y-coordinates of all the knots simultaneously, while keep- 
ing fixed the corresponding ^-coordinates. This numerical fit also 
requires a preliminary guess solution, than can be easily obtained 
through (TVknots — 1) independent ordinary least-squares fit of the 
data placed between each consecutive pair of knots, using for this 
purpose simple polynomials of degree 1 or 2. In this guess solution 
the y-coordinate for each knot is then evaluated as the average value 
for the two neighbouring preliminary polynomial fits (only one for 
the knots at the borders of the x -range). Obviously, if there is ad- 
ditional information concerning a more suitable knot arrangement 
than the equidistant pattern, it must be used to start the process with 
an even better initial solution which will facilitate a faster conver- 
gence to the final solution. 

(iii) Refine the fit. Once some initial spline coefficients have 
been determined, the fit is refined by setting as free parameters the 
location of all the inner knots, both in the x- and y-directions. The 
outer knots (the first and last in the ordered sequence) are only al- 
lowed to be refined in the y-axis direction with the aim of preserv- 
ing the initial x-range coverage. The simultaneous minimisation of 
the x and y coordinates of all the knots at once will imply find- 
ing the minimum of a multidimensional function with too many 
variables. This is normally something very difficult, with no guar- 
antee of a fast convergence. The problem reveals to be treatable just 



by solving for the optimised coordinates of every single knot sep- 
arately. In practice, a refinement can be defined as the process of 
refining the location of all the Aknots knots, one at a time, where the 
order in which a given knot is optimised is randomly determined. 
Each knot optimisation requires, in turn, a value for the maximum 
number of iterations allowed iVmaxiter- Thus, at the end of every sin- 
gle refinement process all the knots have been refined once. An 
extra penalisation can be introduced in the cost function with the 
idea of avoiding that knots exchange their order in the list of or- 
dered sequence of knots. This inclusion typically implies that, if 
Aimots is large, several knots end up colliding and having the same 
coordinates.The whole process can be repeated by indicating the 
total number of refinement processes, An-fine- 

(iv) Optimise the number of knots. If after N K n m refinement pro- 
cesses several knots have collided and exhibit the same coordinates, 
this is an evidence that Akn ts was probably too large. In this case, 
those colliding knots can be merged and the effective number of 
knots be accordingly reduced. If, on the contrary, the knots being 
used do not collide, it is interesting to check whether a higher Aknots 
can be employed. With the new Aknots, step (iii) is repeated again. 

Although at first sight it may seem excessive to use a large 
number of knots when some of them are going to end up colliding, 
these collisions will typically take place at optimised locations for 
the considered fit. As far as the minimisation algorithm is able to 
handle such large Aknots, it is not such a bad idea to start using an 
overestimated number and merge the colliding knots as the refine- 
ment processes take place. 

The fitting algorithm can be halted once a satisfactory fit is 
found at the end of step (iii). By satisfactory one can accept a fit 
which coefficients do not significantly change by increasing neither 
ATreflne nor Nmsuitet, and in which there are no colliding knots. 

3.3 Example: boundary fitting to adaptive splines and 
comparison with simple polynomials 

To illustrate the flexibility of adaptive splines, Fig. [4^ displays 
the corresponding upper boundary fit employing the same example 
data displayed in Fig. [2] for the case Aknots = 15. The preliminary 
fit (shown as a dotted blue line) was computed by placing the Aknots 
equidistantly spread in the x-axis range exhibited by the data, and 
performing (Aknots — 1) independent ordinary least-squares fit of 
the data placed between each consecutive pair of knots, using 2nd 
order polynomials, as explained in step (ii). Although unavoidably 
this preliminary fit is far from the final result (due to the fact that 
this is just the merging of several independent ordinary fits through 
data exhibiting large scatter and that the x -range between adjacent 
knots is not large), after Amaxiter iterations without any refinement 
(i.e., without modifying the initial equidistant knot pattern) the al- 
gorithm provides the fit shown as the dashed green line. The light 
grey lines display the resulting fits obtained by allowing the knot 
locations to vary, and after 40 refinements one gets the boundary 
fit represented by the continuous red line. Since the knot location 
has a large influence in the quality of the boundary determination, 
very high values for Amaxiter are not required (typically values for 
the number of iterations needed to obtain refined knot coordinates 
are ~ 100). Analogously to what was done with the simple poly- 
nomial fit, in Fig. |4p and |4j; the effects of varying the asymme- 
try coefficient £ and the power a are also examined. In the case 
of £, it is again clear that the highest value (£ = 1000) leads to a 
tighter fit. Concerning the power a, the best result is obtained when 
distances are considered quadratically, i.e. a — 2. For the largest 
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Figure 4. Example of the use of adaptive splines to compute the up- 
per boundary of the same sample data displayed in Fig. [2] In this case 
^icnots = 15 has been employed. Panel (a): the preliminary fit (dotted blue 
line) shows the initial guess determined from (N^ nots — 1) independent or- 
dinary least-squares fit of the data, as explained in Section |3~3l By impos- 
ing Afmaxiter = 1000 the fit improves, although in most cases the effective 
iVmaxiter is much lower since the algorithm computes spline coefficients 
that have converged before the number of iterations reaches that maximum 
value. The dashed green line shows the first fit obtained with still the knots 
at their initial equidistant locations. Successive refinements (light grey) al- 
low the knots to change their positions, which leads to the final boundary 
determination (continuous red line, corresponding to iV re fine = 40). In all 
these fits £ = 1000, a = 2 and = have been employed. Panel (b): 
Effect of using different asymmetry coefficients £ for the upper bound- 
ary fit shown in the previous panel. In the four cases Af max ; ter = 1000, 
-'Vrefine = 40, a = 2 and /3 = were used. Panel (c): Effect of employing 
different values of the power a, with £ = 1000, N rt fi ne = 40 and /3 = 0. 
See discussion in Section l3~3l 



values, a = 3 and 5, the resulting boundaries leave points above 
the fits. The case a = 1 is not very different to the quadratic fit, 
although in some regions (e.g. x £ [0.01,0.04]) the boundary is 
probably too high. In addition, Fig. [5] displays the variation in the 
location of the knots as iV r cfine increases, for the final fit displayed 
in Fig. [4^. The initial equidistant pattern (open blue circles; cor- 
responding to TVrefine = 0) is modified as each individual knot is 
allowed to change its coordinates. It is clear that some of the knots 
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Figure 5. Variation in the location of the knots corresponding to the upper 
boundary fitting to adaptive splines displayed in Fig. [4^. Before introducing 
any refinement (Af re fl n c = 0), the 15 knots were regularly placed, as shown 
with the open blue circles. In each refinement process the inner knots are 
allowed to modify its location, one at a time. The first and last knots are 
fixed in order to preserve the fitted rr-range. The final knot locations after 
^refine = 40 are shown with the filled red triangles. 
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Figure 6. Comparison between different functional forms for the bound- 
ary fitting. The sample data set corresponds to the same values employed 
in Figs.|2]and|4] The boundaries have been determined using simple poly- 
nomials of 5th degree (continuous blue lines) and adaptive splines (dotted 
red lines; A^ots = 15 and N rs fi nc = 40), following the steps given in Sec- 
tions l2.3l and l3.3l respectively. The shaded area is simply the diagram region 
comprised between both adaptive splines boundaries. As expected, adaptive 
splines are more flexible, providing tighter boundaries than simple polyno- 
mials. 



approximate and could be, in principle, merged into single knots, 
revealing that the initial number of knots was overestimated. 

Finally Fig.[6]presents, for the same sample data employed in 
Figs.|2]and[4] the comparison between the boundary fits to simple 
polynomials (continuous blue lines) and to adaptive splines (dot- 
ted red lines). The shaded area corresponds to the diagram region 
comprised between the two adaptive splines boundaries. In this fig- 
ure both the upper and the lower boundary limits, computed as de- 
scribed previously, are represented. It is clear from this graphical 
comparison that the larger number of degrees of freedom intro- 
duced with adaptive splines allows a much tighter boundary de- 
termination. The answer to the immediate question of which fit 
(simple polynomials or splines) is more appropriate will obviously 
depend on the nature of the considered problem. 
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4 PRACTICAL APPLICATIONS 

4.1 Estimation of spectra pseudo-continuum 

As mention in Section [TJ a typical situation in which the compu- 
tation of a boundary can be useful is in the estimation of spec- 
tra pseudo-continuum. The strengths of spectral features have been 
measured in different ways so far. However, although with slight 
differences among them, most authors have employed line-strength 
indices with definitions close to the classical expression for an 
equivalent width 



EW(A) 



line 



(1 - S(A)/C(A)dA, 



(6) 



where S{\) is the observed spectrum and C(A) is the local 
continuum, usually obtained by i nterpolation of S(\) between 
two adjacent spectral r egions (e.g. lFaberlfl973l : I Faber et al ]l 19771: 
IWMtford&Rkhlll983h. In practice, as pointed out by iGeislej 
dl984l) (see also lRichl 1988) at low and intermediate spectral res- 
olution the local continuum is unavoidably lost, and a pseudo- 
continuum is measured instead of a true continuum. The upper 
boundary fitting, either by using simple polynomials or adaptive 
splines, constitutes an excellent option for the estimation of that 
pseudo-continuum. To illustrate this statement, several examples 
are presented and discussed in this section. In all these examples, 
the boundary fits have been computed ignoring data uncertainties, 
i.e., assuming f3 = in Eq. (5J. The impact of errors is this type of 
application is discussed later, in Section|5] 

Fig. |7j displays upper boundary fits for the particular stel- 
lar spectrum of HD003651 b elonging to the MILES^I library 
( Sanchez-Blazquez et al. 2006). The results using simple polyno- 
mials and adaptive splines with different tunable parameters are 
shown. Panels [TJt and [7J5 show the results derived using simple 
5th-order polynomials, whereas panels [7J; and [TJi display the fits 
obtained employing adaptive splines with Aknots = 5. The impact 
of modifying the asymmetry coefficient £ is explored in panels [7^ 
and[7J: (in these fits, a — 2 and N mw uat = 1000 have been used; the 
adaptive splines fits were refined A re fi n e = 10 times). The dashed 
blue lines indicate the ordinary least-squares fits, i.e., those ob- 
tained when there is no effective asymmetry (£ = 1), which in each 
case was used as the initial guess fit in the numerical minimisa- 
tion process. For relatively low values of the asymmetry coefficient 
(£ = 10 or 100) the fits are not as good as when using the largest 
value (£ = 1000). This is easy to understand, since the relatively 
large number of points to be fitted in this example (A — 3847), 
requires that the points that still fall in the outer side of the bound- 
ary during the numerical minimisation of Eq. {2]l overcome the 
pulling effect of the points in the inner side of the boundary. On 
the other hand, panels [7p and[7Ji display the effect of changing the 
power a in the fits. Again, the dashed blue lines correspond to the 
ordinary least-squares fits (in the rest of the cases £ = 1000 and 
A m axiter = 1000 have been used; the adaptive splines fits were re- 
fined Arefine = 10 times). In these cases, the best boundary fits are 
obtained for a = 1, whereas for the larger values the fits depart 
from the expected result. 

The above example illustrates that the optimal asymmetry co- 
efficient £ and power a during the boundary procedure can (and 
must) be tuned for the particular problem under study. Not sur- 
prisingly, this fact also concerns the number of knots when using 
adaptive splines. Fig. [8] shows the different results obtained when 



See http : //www .ucm.es/info/ As trof/miles/ 
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Figure 8. Examples of pseudo-continuum fits obtained using adaptive 
splines with different number of knots. The same stellar spectrum displayed 
in Fig.|7]is employed here. The dashed blue line indicates the ordinary least- 
squares fit of the data (f = 1, a = 2). In the rest of the fits, £ = 1000, 
a = 1 and -/V ren „ e = 20 have been used. The effect of using a different 
value of ./Vknots is clearly visible. See discussion in Section PTTl 



estimating the pseudo-continuum in the same stellar spectrum pre- 
viously considered, employing different values of Aknots. As ex- 
pected, the fit adapts to the irregularities exhibited by the spectrum 
as the number of knots increases. This is something that for some 
purposes may not be desired. For instance, the fits obtained with 
Aknots = 12, and more notably with Ak„ot s = 16, detect the absorp- 
tion around the Mg I feature at A ~ 5200 A, and for this reason 
these fits underestimate the total absorption produced at this wave- 
length region. In situations like this the boundary obtained with a 
lower number of knots may be more suitable. Obviously there is 
no general rule to define the right Aknots, since the most convenient 
value will depend on the nature of the problem under study. 

In order to obtain a quantitative determination of the impact 
of using the upper boundary fit instead in the estimation of lo- 
cal pseudo-continuum, Fig.[9]compares the actual line-strength in- 
dices derived for three Balmer lines (H/3, H7 and H<5, from right 
to left) using three different strategies. For this particular exam- 
ple the same stellar spectrum displayed in Fig. [7] has been used. 
Overplotted on each spectrum are the bandpasses typically used 
for the measurement of these spectroscopic features. In pa rticular, 
de ban dpasses limits for Hf3 are the revised values given by Trager 
( 1997), whereas for H7 and H£ the limit s corre spond to H7.F and 
H<5f, as defined bv lWorthev & Ottavianil dl997t) . For each feature, 
the corresponding line-strength has been computed by determining 
the pseudo-continuum using: i) the straight line joining the mean 
fluxes in the blue and red bandpasses (top panels) which is the tra- 
ditional method; ii) the straight line joining the values of the upper 
boundary fits evaluated at the centres of the same bandpasses (cen- 
tral panels); and iii) the upper boundary fits themselves (bottom 
panels). For the cases ii) and iii) the upper boundary fits have been 
derived using a second order polynomial fitted to the three band- 
passes. The resulting line-strength indices, numerically displayed 
above each spectrum, have been computed as the area comprised 
between the adopted pseudo-continuum fit and the stellar spectrum 
within the central bandpass. For the three Balmer lines it is clear 
that the use of the boundary fit provides larger indices. The tra- 
ditional method provides very bad values for H7 and H<5 (which 
are even negative!), given that the pseudo-continuum is very seri- 
ously affected by the absorption features in the continuum band- 
passes. This is a well-known problem that has led many authors 
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Figure 7. Examples of pseudo-continuum fits derived using upper boundaries with different tunable parameters. Panels (a) and (b) correspond to simple 5th 
order polynomials, w hereas adaptive splines have bee n employed in panels (c) and (d). The stellar spectrum corresponds to the K0V star HD003651 belonging 
to the MILES library I Sanchez-Blaz quez et alj200q) . In the four panels the dashed blue line indicates the ordinary least-squares fit of the data. See discussion 
in Section ETTl 



to seek for alternative ban dpass definitions (see e.g. Rose] 1 19941 ; 
Vazdekis & Arimotcll 19991) which, on the other hand, are not im- 
mune to other problems related to their sensitivity to spectral res- 
olution and their high signal-to-noise requirements. These are very 
important issues that deserve a much careful analysis, that is be- 
yond the aim of this paper, and they are going to be studied in a 
forthcoming work (Cardiel 2009, in preparation). 

The results of Fig. [7] reveal that, for the wavelength inter- 
val considered in that example, the boundary determinations ob- 
tained by using polynomials and adaptive splines are not very dif- 
ferent. However, it is expected that as the wavelength range in- 
creases and the expected pseudo-continuum becomes more com- 
plex, the larger flexibility of adaptive splines in comparison with 
simple polynomials should provide better fits. To explore this flex- 
ibility in more detail, Fig. [10] shows the result of using adaptive 
splines to estimate the pseudo-continuum of 12 different spectra 
corresponding to stars exhibiting a wide range of spectral types 
(from B 5V to M5V), selected from t he empirical stellar library 
MILES JSanchez-Blazquez et alJkoOfj) previously mentioned. Al- 
though in all the cases the fits have been computed blindly with- 
out considering the use of an initial knot arrangement appropriate 
for the particularities of each spectral type, it is clear from the fig- 
ure that adaptive splines are flexible enough to give reasonable fits 
independently of the considered star. More refined fits can be ob- 
tained using an initial knot pattern more adjusted to the curvature 
of the pseudo-continuum exhibit by the stellar spectra. 



A good estimation of spectra pseudo-continuum is very useful, 
for example, when correcting spectroscopic data from telluric ab- 
sorptions using featureless (or almost featureless) calibration spec- 
tra. This is a common strategy when performing observations in 
the near-infrared windows. Fig. II lb illustrates a typical example, 
in which the observation of the hot star V986 Oph (HD165174, 
spectral type B0III) is employed to determine the correction. This 
star was observed in the J ban d as part of the calib ration work 
of the observations presented in ICardiel et alj d2003h - The stellar 
spectrum is shown in light grey, whereas the blue points indicate 
a manual selection of spectrum regions employed to estimate the 
overall pseudo-continuum. The dotted green line corresponds to the 
ordinary least-squares fit of these points, whereas the red continu- 
ous line is the upper boundary obtained with adaptive splines using 
ATknots = 3 with an asymmetry coefficient £ == 10000. In Fig. 1 1 lb 
the ratio between both fits in represented, showing that there are dif- 
ferences up to a few percent between these fits. Two kind of errors 
are present here. In overall the ordinary least-squares fit underesti- 
mates the pseudo-continuum level, which introduces a systematic 
bias on the resulting depth of the telluric features (the whole curve 
displayed in Fig. I lib is above 1.0). In addition, since the selected 
blue points do include real (although small) spectroscopic features, 
there are variations as a function of wavelength of the above dis- 
crepancy. These differences can be important when trying to per- 
form a high-quality spectrophotometric calibration. It is important 
to highlight that an important additional advantage of the bound- 
ary fitting is that this method does not require the masking of any 
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Figure 9. Comparison of different strategies in the computation of the pseudo-continuum for the measurement of line-strength indices. The same stellar 
spectrum displayed in Fig.|7]is employed here. In this example three Balmer features are analised, namely H<5, H7 and H/3 (from left to right), showing the 
commonly employed blue, central and red sidebands used in their measurement. Top panels correspond to the traditional method in stellar population studies, 
in which the pseudo-continuum is computed as the straight line joining the mean fluxes in the blue and red sidebands, respectively. In the middle panels the 
pseudo-continua have been computed as the straight line joining the values of the upper boundary fits (second order polynomials fitted to the three bandpasses; 
dotted lines), evaluated at the centres of the blue and red bandpasses. Finally, in the bottom panels the pseudo-continua are not computed as straight lines, 
but as the upper boundary fits themselves. In each case the resulting line-strength value (area comprised between the pseudo-continuum fit and the stellar 
spectrum) is shown. See discussion in Section PPI 



region of the problem spectrum, which avoids the effort (and the 
subjectivity) of selecting special points to guide the fit. 

Another important aspect concerning the use of boundary fits 
for the determination of the pseudo-continuum of spectra is that this 
method can provide an alternative approach for the estimation of 
the pseudo-continuum flux when measuring line-strength indices. 
Instead of using the average fluxes in bandpasses located nearby 
the (typically central) bandpass covering the relevant spectroscopic 
feature, the mean flux on the upper boundary can be employed. In 
this case it is important to take into account that flux uncertainties 
will bias the fits towards higher values. Under these situations the 
approach described later in Section[5]can be employed. Concerning 
this problem is w orth mentioning here the method presented by 
iRogers et alj ( feOOijf) . who employ a boosted median continuum to 
derive equivalent widths more robustly than using the classic side- 
band procedure. 



previously used in Figs. [2] [4] and [6] Once the lower and the up- 
per boundaries are available, it is trivial to generate a grid of lines 
dividing the region comprised between the boundaries as needed. 

A more complex scenario is that in which the data exhibit a 
clear scatter around some tendency, and one needs to determine 
regions including a given fraction of the points. A frequent case ap- 
pears when one needs to remove outliers, and then it is necessary 
to obtain an estimation of the regions containing some relevant per- 
centages of the data. In Fig. 112b this situation is exemplified with 
the use of a simulated data set consisting in 30000 points, for which 
the regions that include 68.27% and 95.44% of the centred data 
points, corresponding to ±1<t and ±2<r in a normal distribution, 
have been determined by first selecting those data subsets, and then 
fitting their corresponding boundaries using adaptive splines, as ex- 
plained with more detail in the figure caption. 



4.2 Estimation of data ranges 

A quite trivial but useful application of the boundary fits is the em- 
pirical determination of data ranges. One can consider scenarios 
in which it is needed to subdivide the region spanned by the data 
in a particular grid. Fig. [T2b illustrates this situation, making use 
of the 5th order polynomial boundaries corresponding to the data 



5 THE IMPACT OF DATA UNCERTAINTIES 

Although the method described in Section [2] already takes into ac- 
count data uncertainties through their inclusion as a weighting pa- 
rameter (governed by the exponent /3), it is important to highlight 
that this weighting scheme does not prevent the boundary fits to 
be highly biased due to the presence of such uncertainties. For ex- 
ample, in the determination of the pseudo-continuum of a given 
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Figure 10. Examples of pseudo-continuum fits using adaptive splines. Several stars from the stellar library MILES (Sanchez-Blazquez et al. 2006), spanning 
different spectral types, have been selected. The fitted pseudo-continua (continuous black line) have been automatically determined employing N^ ois = 19, 

iVmariter = 1000, define = 20, £, = 1000, Q = 2 and f3 = 0. 



spectrum, even considering the same error bars for the fluxes at 
all wavelengths, the presence of noise unavoidably produces some 
scatter around the real data. When fitting the upper boundary to 
a noisy spectrum the fit will be dominated by the points that ran- 
domly exhibit the largest positive departures. Under these circum- 
stances, two different alternatives can be devised: 

(i) To perform a previous rebinning or filtering of the data prior 
to the boundary fitting, in order to eliminate, or at least minimize, 
the impact of data uncertainties. After the filtering one assumes 
that these uncertainties are not seriously biasing the boundary fit. 
In this way one can employ the same technique described in Sec- 
tion^ This approach is illustrated in Fig.|13h. In this case the orig- 
inal spectrum of HD00365 (also employed in Figs. [7] and [8}, as 
extracted from the MILES library (Sanchez-Blaz quez et alj |2006), 
is considered as a noise-free spectrum (plotted in blue). Its corre- 
sponding upper boundary fit using adaptive splines with iVknots = 5 
is shown as the cyan line. This original spectrum has been artifi- 
cially degraded by considering an arbitrary signal-to-noise ratio per 
pixel S/N=10 (displayed in green), and the resulting upper bound- 
ary fit is shown with a dashed green line. It is obvious that this 
last fit is highly biased, being dominated by the points with higher 
fluxes. Finally, the noisy spectrum has been filtered by convolving 
it with a Gaussian kernel (of standard deviation 100 km/s), with 
the result being over-plotted in red. Note that this filtered spectrum 
overlaps almost exactly with the original spectrum. The boundary 
fit plotted with the continuous orange line is the upper boundary 



for that filtered spectrum. Although the result is not the same as the 
one derived with the original spectrum, it is much better than the 
one directly obtained over the noisy spectrum. 

(ii) To allow a loose boundary fitting. Another possibility con- 
sists in trying to leave a fraction of the points with extreme values to 
fall outside (i.e., in the wrong side) of the boundary, specially those 
with higher uncertainties. This option is easy to parametrize by in- 
troducing a cut-off parameter r into the overall weighting factors 
given in Eq. {3). The new factors can then be computed as 



(7) 



where Oi is the uncertainty associated to the dependent variable j/j . 
The cut-off parameter assigns to a point that falls outside of the 
boundary by distance that is less than or equal to ro~i the same low 
weight during the fitting procedure than the weight that receive the 
inner points. In other words, points like that do not receive the ex- 
tra weighting factor provided by the asymmetry coefficient even 
though they are outside of the boundary. Note that r = simplifies 
the algorithm to the one described in Section|2] Fig. 1 13b illustrates 
the use of the cut-off parameter r in the upper boundary fitting 
of the spectrum of HD003651. The cyan boundary is again the up- 
per boundary determination using adaptive splines with the original 
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Figure 11. Comparison of the results of using an ordinary fit and adaptive 
splines when deriving the telluric correction in a particular spectroscopic 
calibration. Panel (a): the light grey line corresponds to the spectrum ob- 
tained in the J band of the hot star HD 165 174. Some special points of 
this spectrum have been manually selected (small blue points) to determine 
the approximate pseudo-continuum. The resulting ordinary fit to adaptive 
splines (i.e. adopting £ = 1) using exclusively these selected points is dis- 
played with the dotted green line. A more suitable fit (continuous red line) 
is obtained employing £ = f 0000, in which case the fit is performed over 
the whole spectrum. The two fits have been carried out with N^ais = 3, 
Amaxiter = 1000, ./V re fi ne = 10, a = 2 and ,3 = 0, Panel (by. ratio between 
the two fits displayed in the previous panel. 

spectrum. The rest of the boundary fits correspond to the use of the 
weighting scheme given in Eq. ((7) for different values of r, as in- 
dicated in the legend. As r increases, a larger number of points are 
left outside of the boundary during the minimisation procedure. In 
the example, the value r = 3 seems to give a reasonable fit in the 
redder part of the spectrum, although in the bluer region the corre- 
sponding fit is too low. It is clear from this example that to define a 
correct value of r is not a trivial issue. Most of the times the most 
suited r will be a compromise between a high value (in order to 
avoid the bias introduced by highly deviant points) and a low value 
(in order to avoid leaving outside of the boundary right data points). 

An additional complication arises when one combines in the 
same data set points with different uncertainties. It is in these sit- 
uations when the role of the power j3 in Eq. l(2j becomes impor- 
tant. To illustrate the situation, Fig.ll4lshows the different pseudo- 
continuum estimations obtained again for the star HD003651, but 
now considering that the spectrum is much noisier below 4200 A 
than above this wavelength. In panel [T4"h the fits are derived ig- 
noring the cut-off parameter previously discussed (i.e. assuming 
r = 0), but with different values of (5. In the unweighted case 
(J3 — 0, dashed green line) the resulting upper boundary is dramat- 
ically biased for A < 4200 A due to the presence of highly deviant 
fluxes. The use of non-null (and positive) values of f3 induces the 
fit to be less dependent on the noisier values, being necessary a 
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Figure 12. Examples of data boundary applications for the estimation of 
data ranges. Panel (a): Using the lower and upper boundary limits for the 
data displayed in Figs. [2] [4] and [6] and computed using simple 5th order 
polynomials, it is trivial to subdivide the range spanned by the data in the 
j/-axis by creating a regular grid (i.e. contant Ay at a fixed x) between both 
boundary limits. In this example the region has been subdivided in ten in- 
tervals. Panel (b): 30000 points randomly drawn from the functional form 
y = 1/x, with cr — 10 for all the points. Splitting the x-range in 100 inter- 
vals, sorting the data within each interval and keeping track of the subsets 
containing 68.27% (±lcr; blue points) and 95.44% (±2<r; green points) of 
the data points around the median, it is possible to compute the upper and 
lower boundaries for those two subsets (continuous red and orange lines, 
respectively). The boundaries in this example have been determined using 
adaptive splines with N^ ots = 15, iVitetmax — 1000, Nteine = 10, a = 2, 
and 13 = 0. 

value as high as j3 — 3 to obtain a fit similar to the one obtained 
in absence of noise (cyan line). However, since the fitted spectrum 
(green) do still have noise for A > 4200 A, all the fits in that re- 
gion are still biased compared to the fit for the original spectrum 
(cyan). In order to deal not only with the variable noise, but with 
the noise itself independently of its absolute value, it is possible to 
combine the effect of a tuned /3 value with the introduction of a 
cut-off parameter r. Fig. 1 14b shows the results derived employing 
a fixed value r = 2 with the same variable values of f3 used in the 
previous panel. In this case, the boundary corresponding to f3 — 2 
(magenta) exhibits an excellent agreement with the fit for the orig- 
inal spectrum (cyan) at all wavelengths. Thus, the combined effect 
of an error-weighted fit and the use of a cut-off parameter is provid- 
ing a reasonable boundary determination, even under the presence 
of wavelength dependent noise. 



6 CONCLUSIONS 

This work has confronted the problem of obtaining analytical ex- 
pressions for the upper and lower boundaries of a given data set. 
The task reveals treatable using a generalised version of the very 
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Figure 13. Comparison of the two approaches described in Section \5\ for 
the boundary fitting with data uncertainties. Panel (a): original spectrum of 
HD00365 1 without noise (blue spectrum), spectrum with artificially added 
noise (green spectrum) and noisy spectrum after a Gaussian filtering (red 
spectrum). Note that the original (blue) and the filtered noisy (red) spectra 
are almost coincident. The upper boundary displayed with a dashed green 
line is the fit to the noisy spectrum using adaptive splines, whereas the up- 
per boundaries plotted with continuous orange and cyan lines are the fits 
to the filtered noisy spectrum and to the original spectrum, respectively. 
Panel (b): original and noisy spectra are plotted with blue and green lines, 
respectively (the filtered spectrum is not plotted here). The cyan line is again 
the fit to the original spectrum. The rest of the boundary lines indicate the 
fits to the noisy spectrum using different values of the cut-off parameter 
(red r = 1, orange r = 2, and green r = 3). In all the fits = 5, 

iVmariter = 1000, £ = 1000, a = 1, f3 = 0, and Ar refinc = 10 have been 
employed. See discussion in Section|5] 



well-known ordinary least-squares fit method. The key ideas be- 
hind the proposed method can be summarised as follows: 

• The sought boundary is iteratively determined starting from 
an initial guess fit. For the analysed cases an ordinary least-squares 
fit provides a suitable starting point. At every iteration in the pro- 
cedure a particular fit is always available. 

• In each iteration the data to be fitted are segregated in two 
subgroups depending on their position relative to the particular fit 
at that iteration. In this sense, points are classified as being inside 
or outside of the boundary. 

• Points located outside of the boundary are given an ex- 
tra weight in the cost function to be minimized. This weight is 
parametrized through the asymmetry coefficient £. The net effect of 
this coefficient is to generate a stronger pulling effect of the outer 
points over the fit, which in this way shifts towards the frontier de- 
lineated by the outer points as the iterations proceed. 

• The distance from the points to a given fit are introduced in 
the cost function with a variable power a, not necessarily in the 
traditional squared way. This supplies an additional parameter to 
play with when performing the boundary determination. 
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Figure 14. Study of the impact of variable signal-to-noise ratio in the upper 
boundary fitting of the spectrum of the star HD00365 1 . In both panels the 
original spectrum (blue) is plotted together with the same spectrum after 
artificially adding noise (green) corresponding to a signal-to-noise ratio per 
pixel S/N=3 for A < 4200 A, and to S/N=50 for A > 4200 A. The cyan 
line indicates the upper boundary fit to the original spectrum. Panel {a): 
In these fits the cut-off parameter has been ignored (r = 0), but different 
values of the power /3, as indicated in the legend, are employed. Note that 
the unweighted fit (J3 = 0; dashed green line) is highly biased. Panel (b): 
the same fits of the previous panel are repeated here but using r = 2. In all 
the fits Af knots = 5, N maxlta = 1000, £ = 1000, a = 1, and jV reftoe = 10 
have been employed. See discussion in Section[5] 

• Since data uncertainties are responsible for the existence of 
highly deviant points in the considered data sets, their incorpora- 
tion in the boundary determination has been considered in two dif- 
ferent and complementary ways. Errors can readily be incorporated 
into the cost function as weighting factors with a variable power ft 
(which does not have to be necessarily two). In addition, a cutt-off 
parameter r can also be tuned to exclude outer points from receiv- 
ing the extra factor given by the asymmetry coefficient depending 
on the absolute value of their error bar. The use of both parame- 
ters ([3 and r) provides enough flexibility to handle the role of the 
data uncertainties in different ways depending on the nature of the 
considered boundary problem. 

• The minimisation of the cost function can be easily carried 
out using the popular DOWNHILL simplex method. This allows the 
use of any computable function as the analytical expression for the 
boundary fits. 

The described fitting method has been illustrated with the use 
of simple polynomials, which probably are enough for most com- 
mon situations. For those scenarios where the data exhibit rapidly 
changing values, a more powerful approach, using adaptive splines, 
has also been described. Examples using both simple polynomials 
and adaptive splines have been presented, showing that they are 
good alternatives to estimate the pseudo-continuum of spectra and 
to segregate data in ranges. 
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The analysed examples have shown that there is no magic rule 
to a priori establish the most suitable values for the tunable parame- 
ters (£, a, f3, t, iVmaxiter, A^knots). The most appropriate choices must 
be accordingly tuned for the particular problem under study. In any 
case, typical values for some of these parameters in the considered 
examples are £ G [1000, 10000] and a G [1, 2]. Unweighted fits re- 
quire /3 = 0. To take into account data uncertainties one must play 
around with the /3 and r parameters (which typical values range 
from to 3). 

A new program called BoundFit (and available at the URL 
given in SectionQ} has been written by the author to help any per- 
son interested in playing with the method described in this paper. 
It is important to note that for some problems it is advisable to nor- 
malise the data ranges prior to the fitting computation in order to 
prevent (or at least reduce) numerical errors. BoundFit incorpo- 
rates this option, and the users should verify the benefit of applying 
such normalisation for their particular needs. 
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APPENDIX A: INTRODUCING ADDITIONAL 
CONSTRAINTS IN THE FITS 

Sometimes it is not only necessary to obtain a given functional fit 
to a data set, but to do so while imposing restrictions on some of the 
fitted parameters clq, Oi, . . . , a p . This can be done by introducing 
either equality or inequality constraints, or both. These constraints 
are normally expressed as 

Cj(a ,ai,...,a p ) =0 j = l,...,n e (Al) 
Cj(ao,ai,...,Op)>0 j = n e + 1, • • ■ ,n e + rn (A2) 

being n e and ri\ the number of equality and inequality constraints, 
respectively. In the case of some boundary determinations it may 
be useful to incorporate these type of constraints, for example when 
one needs the boundary fit to pass through some pre-defined fixed 
points, and/or to have definite derivatives at some points (allowing 
for a smooth connection between functions). 

Many techniques that allow to minimize cost functions while 
taking into account su pplementar y constraints are described in 
the literature (see e.g. IRaol Il978l: iGill. Murray & Wrightl [l989l: 



Bazaraa. Sherali & Shetty 1993; Nocedal & Wright 2006; Fletcher 



2007), and to explore them here in detail are beyond the aim of this 
work. However this appendix outlines two basic approaches that 
can be useful for some particular situations. 

Al Avoiding the constraints 

Before facing the minimisation of a constrained fit, it is advisable 
to check whether some simple transformations can help to convert 
the constrained optimisation prob l em in to an unconstrained one by 
making change of variables. IRaol Jl978h presents some useful ex- 
amples. For instance, a frequently encountered constraint is that in 
which a given parameter ai is restricted to lie within a given range, 
e.g. Qz.min < ai < ai jmax . In this case the simple transformation 



O-l — tSi.rain + (<Ji,max — «!,min) 



• 2 , 

sin 0( 



(A3) 



provides a new variable bi which can take any value. If the original 
parameter is restricted to satisfy ai > 0, the trivial transformations 
ai — abs(6;), ai = b\ , or ai — exp(fej) can be useful. 

Unfortunately, when the constraints are not simple functions, 
it is not easy to fi nd the required transformations. As highlighted 
bv lFletcheiH2007l) . the transformation procedure is not always free 
of risk, and in the case where it is not possible to eliminate all the 
constraints by maki ng chan ge of variable, it is better to avoid partial 
transformation dRaoll 19781) . 

An additional strategy that can be employed when handling 
equality constraints is trying to use the equations to eliminate some 
of the variables. For example, if for a given equality constraint Cj is 
possible to rearrange the expression to solve for one of the variables 



Cj =0 — > a s = gj(a ,ai,. . . ,a s -i,a s +i, 



(A4) 



then the cost function simplifies from a function in (p+1) variables 
into a function in p variables 
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f(ao, oi, . . . , a s -i, a s , a 3 +i • • • , a P ) = 
= f(ao, Oi, . . . , ffl a -i, Sj, a s +i • • • , Hp), 



(A5) 



since the dependence on a s is removed. When the considered prob- 
lem only has equality constraints and, in addition, for all of them 
it is possible to apply the above elimination, the fitting procedure 
transforms into a simpler unconstrained problem. 



A2 Facing the constraints 

The weighting scheme underlying the minimisation of Eq. {2} 
is actually an optimisation process based on the penalisation in 
the cost function of the data points that falls in the wrong side 
(i.e. outside) of the boundary to be fitted. For this reason it 
seems appropriate to employ ad ditional penalty functions (see e.g. 
Bazaraa, Sherali & Shetty 1993) to incorporate constraints into the 
fits. 

In the case of constraining the range of some of the parame- 
ters to be fitted, ai im m < a; < a;, max, it is trivial to adjust the value 
of the cost function by introducing a large factor A that clearly pe- 
nalises parameters beyond the required limits. In this sense, Eq. l[2j 
can be rewritten as 



/ = Ah(a , Oi, 



^ Wi\y(xi 



(A6) 



where h(ao , oi, . . . , a p ) is a function that is null when the required 
parameters are within the requested ranges (i.e., the fit is performed 
in an unconstrained way), and some positive large value for the 
contrary situation. 

For the particular case of equality constraints of the form given 
in Eq. dA It . it is possible to directly incorporate these constraints 
into the cost functions as 



N 

f = A^ \cj(a ,ai, ...,a p )\ a + w i\v( x i) ~ Vi\ 

l 



(A7) 



In this situation, for the constraints to have an impact in the cost 
function, the value of the penalisation factor A must be large 
enough to guarantee that the first summation in Eq. dA7t dominates 
over the second summation when a temporary solution implies a 
large value for any |c, |. 

As an example, Fig. I A 1 1 displays the upper boundary limit 
computed using adaptive splines for the same data previously em- 
ployed in Figs. [2] [4] and [6] but arbitrarily forcing the fit to pass 
through the two fixed points (0.05,100) and (0.20,100), marked in 
the figure with the green open circles. The constrained fit (thick 
continuos red line) has been determined by introducing the two 
equality constraints 



ci : y(x = 0.05) - 100 = 0, and 
c 2 : y(x = 0.20) - 100 = 0. 



(A8) 



The displayed fit was computed using a penalisation factor 
A = 10 B , with an asymmetry coefficient £ = 1000, iVknots = 15, 
-^Vmaxiter = 1000 iterations, iV re sne = 20 processes, a = 2, and 
P = 0. For comparison, another fit (dotted blue line) has also been 
computed by introducing two more constraints, namely forcing the 
derivatives to be zero at the same points, i.e., y'(x = 0.05) = and 
y (x = 0.20) = 0. The resulting fit is clearly different, highlight- 
ing the importance of the introduction of the constraints. 
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Figure Al. Example of constrained boundary fit, using adaptive splines 
with the same data employed in Figs. F2]|4] and [6] The boundary (red line) 
has been forced to pass through the points marked with open circles (green), 
namely (0.05,100) and (0.20,100). To give an important weight to the two 
constraints in Eq. 1A71 . the value of the penalisation factor has been set 
to A = 10 6 . The dotted blue line is the same fit, but introducing two new 
additional constraints, in particular forcing the derivatives to be zero at the 
same fixed points. 



APPENDIX B: NORMALISATION OF DATA RANGES TO 
REDUCE NUMERICAL ERRORS 

The appearance of numerical errors is one of the most important 
sources of problems when fitting functions, in particular polynomi- 
als, to any data set making use of a piece of software. The problems 
can be specially serious when handling large data sets, using high 
polynomial degrees, and employing different and large data ranges. 
Since the size of the data set is usually something that one does not 
want to modify, and the polynomial degree is also fixed by the na- 
ture of the data being modelled (furthermore in the case of cubic 
splines, where the polynomial degree is fixed), the easier way to re- 
duce the impact of numerical errors is to normalise the data ranges 
prior to the fitting procedure. However, although this normalisa- 
tion is a straightforward operation, the fitted coefficients cannot be 
directly employed to evaluate the sought function in the original 
data ranges. Previously it is necessary to properly transform those 
coefficients. This appendix provides the corresponding coefficient 
transformations for the case of the fitting to simple one-dimensional 
polynomials and to cubic splines. 



Bl Simple polynomials 

Simple polynomials are typically expressed as 



y = ao + aix + aix + ■ ■ ■ + a p x . 



(Bl) 



Let's consider that the ranges exhibited by the data in the corre- 
sponding coordinate axes are given by the intervals [xmin, x max ] and 
[?/min, 2/max], and assume that one wants to normalise the data within 
these intervals into new ones given by [x min , Smax] and [y mi „, y max ], 
through a point-to-point mapping from the original intervals into 
the new ones, 



^min, ^maxj , 



and 



[^min , ^maxj 
[^/min , 2/max] * [2/min , ?/max] 

For this purpose, linear transformations of the form 

x — b x x — c x and y = b y y — c y 



(B2) 
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are appropriate, where b and c are constants (b x and b y are scaling 
factors, and c x and c y represent origin offsets in the normalised 
data ranges). The inverse transformations will be given by 



x = — and y = " . . 



(B3) 



Assuming that the original and final intervals are not null (i.e., 

#min 7^ ^max) #min 7^ 5 max? 2/min ^ ?/max and ?/min 7^ 2/maxX it is trivial 

to show that the transformation constants are given by 



% max Xtj 



^ max nun 



(B4) 



(B5) 



and the analogue expressions for the coefficients of the y-axis trans- 
formation. For example, to perform all the arithmetical manipula- 
tions with small numbers, it is useful to choose x mm = y m i„ = — 1 
and i raax = y mlix = +1, which leads to 

2 



b x 



(B6) 



(B7) 



and the analogue expressions for b y and c y . 

Once the data have been properly normalised in both axes fol- 
lowing the transformations given in Eq. dB2t . it is possible to carry 
out the fitting procedure, which provides the resulting polynomial 
expressed in terms of the transformed data ranges as 



y = 2o + CLlX + CL2X 2 + ■ 



+ a p x p . 



(B8) 



At this point, the relevant question is how to transform the fitted 
coefficients So, ai, . . . , a p into the coefficients ao, a\, . . . , a p cor- 
responding to the same polynomial defined over the original data 
ranges. By substituting the relations given in Eq. (B21 in the previ- 
ous expression one directly obtains 

\2 



(b y y - cy) = a h 
+ ■• 

Remembering that 

m 

(b x x - c x ) m = ^ 



ai(b x x - Cx) + a 2 (b x x ■ 
+ a p (b x x — c x ) p . 



{i \m — n { \r 

(b x x) {—c x ) 



- + 



with the binomial coefficient computed as 

m I 



(B9) 



(BIO) 



(Bll) 



n\ (m — n)! ' 

and comparing the substitution of Eq. ( IB 1 Oi l and Eq. ( IB lit into 
Eq. ( |B9t with the expression given in Eq. JBU . it is not difficult to 
show that if one defines 



WC- 



ce 12) 



the sought coefficients will be given by 

h + Cy 



ai = 



for i = 



— with i = l, 



(B13) 



In the particular case in which c x 
plify to 



0, the above expressions sim- 




1 10 100 1000 

Iteration number 

Figure Bl. Variation in the fitted coefficients, as a function of the number 
of iterations, for the upper boundary fit (5th order polynomial) shown in 
Fig.|5Ji. This plot is the same than Fig. [3] but in this case analysing the im- 
pact of the normalisation of the data ranges prior to the boundary determina- 
tion. Each panel represents the coefficient value at a given iteration (ai, with 
i = 0, . . . , 5, from bottom to top) divided by a* , the final value derived af- 
ter Af max i, er = 2000 iterations. The same j/-axis range is employed in all 
the plots. The red line shows the results when applying the normalisation, 
and the blue line indicates the coefficient variations when this normalisation 
is not applied. In both cases £ = 1000, a = 2 and (3 = were used. Note 
that the plot x-scale is in logarithmic units. 



d + Cy 



by 



f or i = 



with i 



(B14) 



The normalisation of the data ranges has several advantages. 
Fig. IB II (similar to Fig. [3} shows the impact of data normalisation 
on the convergence properties of the fitted coefficients, as a func- 
tion of the number of iterations, for the upper boundary fit (5th 
order polynomial) shown in Fig. [2^. The red line, corresponding to 
the results when the normalisation is applied prior to the boundary 
fitting, indicates that after iVmaxiter ~ 140, the coefficients have con- 
verged. The situation is much worse when the normalisation is not 
applied, as illustrated by the blue line. In this case the convergence 
is only reached after iVmaxiter ~ 1450 iterations, ten times more than 
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Figure B2. Example of the appearance of numerical errors in the bound- 
ary fitting with simple polynomials. The fitted data set consists in 10000 
points randomly drawn from the function y = sin(1.5x)/(l + x) for 
x S [0, 2tt], assuming a Gaussian error cr = 0.02 in the y-axis, and where 
prior to the data fitting the (x, y) coordinates were transformed using 
Xflt = 1000 + 500 ^original and yst = 1000 {/original in order to artificially 
enlarge the data ranges. Panel (a): bootstrapped data and fitted boundaries. 
Panel (b): residuals relative to the original sinusoidal function. In both pan- 
els the lines indicate the resulting fits for different polynomial degrees and 
normalisation strategies (in all the cases £ = 1000, a = 2 and /3 = were 
employed). The continuous red lines are the boundaries obtained using 
polynomials of degree 10 and normalising the data ranges prior to the fitting 
procedure. The green and blue lines correspond to the fits obtained by fitting 
polynomials of degrees 9 and 8, respectively, without normalising the data 
ranges. Using the original data ranges the boundary fits start to depart from 
the expected location due to numerical errors for polynomials of degree 9. 
However polynomials of degree 10 are still an option when the data ranges 
are previously normalised. 



when using the normalisation. In addition, the ranges spanned by 
the coefficient values along the minimisation procedure are nar- 
rower when the data ranges have been previously normalised. 

Fig. IB2I exemplifies the appearance of numerical errors that 
takes place when increasing the polynomial degree during the fit- 
ting of a reasonably large data set. In this case 10000 points are 
fitted employing upper and lower boundaries with simple polyno- 
mials of degree 10 (red lines) after normalising the data ranges us- 
ing the coefficients given in Eqs. l TB6t and l(B7]l (with the analogue 
expressions for the j/-axis coefficients) prior to the numerical min- 
imisation. When the data ranges are not normalised, the fitting to 
polynomials of degree 10 gives non-sense results. Only polynomi- 
als of degree less or equal than 9 are computable. And for the case 
of degree 9 the results are unsatisfactory (green lines), being the 
polynomials of degree 8 (blue lines) the first reasonable boundaries 
while fitting the data preserving their original ranges. Thus in this 
particular example the normalisation of the data ranges allows to 
extend the fitted polynomial degree in two units. 



B2 Cubic splines 

Normalisation of the data ranges is also important for the computa- 
tion of cubic splines, in particular for the boundary fitting to adap- 
tive splines described in Section [3] In that section the functional 
form of a fit to set of iVinots was expressed as 



y = s 3 (k)[x - 

+ Sl(fc)[jJ - 



a;knot(fc)] 3 + s 2 (k)[x - 

^knol(fc)] + S (fc), 



£knot(fc)] 2 + 



(B15) 



where (x^ nol (k), yknoi(fc)) are the (x,y) coordinates of the 
fc" 1 knot, and so(fc), si(fc), S2(k), and S3(k) are the corre- 
sponding spline coefficients for x 6 [:Eknot(fc), a;knoi(fc + 1)], with 

k = 1, . . . , iVknots - 1. 

Using the same nomenclature previously employed for the 
case of simple polynomials, the result of a fit to cubic splines per- 
formed over normalised data ranges should be written as 



V = S3(k)[x - 
+ s 1 (k)[x - 



iknot(fc)] 3 + S2(k)[x - 
' Sknot(fc)] + S (fc). 



(k)} 2 + 



(B16) 



Following a similar reasoning to that used previously, it is straight- 
forward to see that the sought transformations are 



Si(fc) 



§ (k) + Cy 



§i(k)bi 



for i = 



with i = 1, 



(B17) 



,3 



where k = 1, . . . , JVknots — 1. Note that these transformations are 
identical to Eq. dB 14I ). This is not surprising considering that 
splines are polynomials and that the adopted functional form given 
in Eq. dB 1 5t is actually providing the y(x) coordinate as a function 
of the distance between the considered x and corresponding value 
^knoc(fc) for the nearest knot placed at the left side of x. Thus, the 
c x coefficient is not relevant here. 

B3 A word of caution 

Although the method described in this appendix can help in some 
circumstances to perform fits with larger data sets or higher poly- 
nomial degrees than without any normalisation of the data ranges, 
it is important to keep in mind that such normalisation does not al- 
ways produce the expected results and that numerical errors appear 
in any case sooner or later if one tries to use excessively large data 
sets or very high values for the polynomial degrees. 

Anyhow, the fact that the normalisation of the data ranges can 
facilitate the boundary determination of large data sets or to use 
higher polynomial degrees justifies the effort of checking whether 
such normalisation is of any help. Sometimes, to extend the poly- 
nomial degrees by even just a few units can be enough to solve the 
particular problem one is dealing with. The program BoundFit 
incorporates the normalisation of the data prior to the boundary fit- 
ting as an option. 



